32 research outputs found
Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees
The tasks of extracting (top-) Frequent Itemsets (FI's) and Association
Rules (AR's) are fundamental primitives in data mining and database
applications. Exact algorithms for these problems exist and are widely used,
but their running time is hindered by the need of scanning the entire dataset,
possibly multiple times. High quality approximations of FI's and AR's are
sufficient for most practical uses, and a number of recent works explored the
application of sampling for fast discovery of approximate solutions to the
problems. However, these works do not provide satisfactory performance
guarantees on the quality of the approximation, due to the difficulty of
bounding the probability of under- or over-sampling any one of an unknown
number of frequent itemsets. In this work we circumvent this issue by applying
the statistical concept of \emph{Vapnik-Chervonenkis (VC) dimension} to develop
a novel technique for providing tight bounds on the sample size that guarantees
approximation within user-specified parameters. Our technique applies both to
absolute and to relative approximations of (top-) FI's and AR's. The
resulting sample size is linearly dependent on the VC-dimension of a range
space associated with the dataset to be mined. The main theoretical
contribution of this work is a proof that the VC-dimension of this range space
is upper bounded by an easy-to-compute characteristic quantity of the dataset
which we call \emph{d-index}, and is the maximum integer such that the
dataset contains at least transactions of length at least such that no
one of them is a superset of or equal to another. We show that this bound is
strict for a large class of datasets.Comment: 19 pages, 7 figures. A shorter version of this paper appeared in the
proceedings of ECML PKDD 201
Finding the True Frequent Itemsets
Frequent Itemsets (FIs) mining is a fundamental primitive in data mining. It
requires to identify all itemsets appearing in at least a fraction of
a transactional dataset . Often though, the ultimate goal of
mining is not an analysis of the dataset \emph{per se}, but the
understanding of the underlying process that generated it. Specifically, in
many applications is a collection of samples obtained from an
unknown probability distribution on transactions, and by extracting the
FIs in one attempts to infer itemsets that are frequently (i.e.,
with probability at least ) generated by , which we call the True
Frequent Itemsets (TFIs). Due to the inherently stochastic nature of the
generative process, the set of FIs is only a rough approximation of the set of
TFIs, as it often contains a huge number of \emph{false positives}, i.e.,
spurious itemsets that are not among the TFIs. In this work we design and
analyze an algorithm to identify a threshold such that the
collection of itemsets with frequency at least in
contains only TFIs with probability at least , for some
user-specified . Our method uses results from statistical learning
theory involving the (empirical) VC-dimension of the problem at hand. This
allows us to identify almost all the TFIs without including any false positive.
We also experimentally compare our method with the direct mining of
at frequency and with techniques based on widely-used
standard bounds (i.e., the Chernoff bounds) of the binomial distribution, and
show that our algorithm outperforms these methods and achieves even better
results than what is guaranteed by the theoretical analysis.Comment: 13 pages, Extended version of work appeared in SIAM International
Conference on Data Mining, 201
An impossibility result for Markov Chain Monte Carlo sampling from micro-canonical bipartite graph ensembles
Markov Chain Monte Carlo (MCMC) algorithms are commonly used to sample from
graph ensembles. Two graphs are neighbors in the state space if one can be
obtained from the other with only a few modifications, e.g., edge rewirings.
For many common ensembles, e.g., those preserving the degree sequences of
bipartite graphs, rewiring operations involving two edges are sufficient to
create a fully-connected state space, and they can be performed efficiently. We
show that, for ensembles of bipartite graphs with fixed degree sequences and
number of butterflies (k2,2 bi-cliques), there is no universal constant c such
that a rewiring of at most c edges at every step is sufficient for any such
ensemble to be fully connected. Our proof relies on an explicit construction of
a family of pairs of graphs with the same degree sequences and number of
butterflies, with each pair indexed by a natural c, and such that any sequence
of rewiring operations transforming one graph into the other must include at
least one rewiring operation involving at least c edges. Whether rewiring these
many edges is sufficient to guarantee the full connectivity of the state space
of any such ensemble remains an open question. Our result implies the
impossibility of developing efficient, graph-agnostic, MCMC algorithms for
these ensembles, as the necessity to rewire an impractically large number of
edges may hinder taking a step on the state space
RePBubLik: Reducing the Polarized Bubble Radius with Link Insertions
The topology of the hyperlink graph among pages expressing different opinions
may influence the exposure of readers to diverse content. Structural bias may
trap a reader in a polarized bubble with no access to other opinions. We model
readers' behavior as random walks. A node is in a polarized bubble if the
expected length of a random walk from it to a page of different opinion is
large. The structural bias of a graph is the sum of the radii of
highly-polarized bubbles. We study the problem of decreasing the structural
bias through edge insertions. Healing all nodes with high polarized bubble
radius is hard to approximate within a logarithmic factor, so we focus on
finding the best edges to insert to maximally reduce the structural bias.
We present RePBubLik, an algorithm that leverages a variant of the random walk
closeness centrality to select the edges to insert. RePBubLik obtains, under
mild conditions, a constant-factor approximation. It reduces the structural
bias faster than existing edge-recommendation methods, including some designed
to reduce the polarization of a graph
Space-Round Tradeoffs for MapReduce Computations
This work explores fundamental modeling and algorithmic issues arising in the
well-established MapReduce framework. First, we formally specify a
computational model for MapReduce which captures the functional flavor of the
paradigm by allowing for a flexible use of parallelism. Indeed, the model
diverges from a traditional processor-centric view by featuring parameters
which embody only global and local memory constraints, thus favoring a more
data-centric view. Second, we apply the model to the fundamental computation
task of matrix multiplication presenting upper and lower bounds for both dense
and sparse matrix multiplication, which highlight interesting tradeoffs between
space and round complexity. Finally, building on the matrix multiplication
results, we derive further space-round tradeoffs on matrix inversion and
matching
Mining Top-K Frequent Itemsets Through Progressive Sampling
We study the use of sampling for efficiently mining the top-K frequent
itemsets of cardinality at most w. To this purpose, we define an approximation
to the top-K frequent itemsets to be a family of itemsets which includes
(resp., excludes) all very frequent (resp., very infrequent) itemsets, together
with an estimate of these itemsets' frequencies with a bounded error. Our first
result is an upper bound on the sample size which guarantees that the top-K
frequent itemsets mined from a random sample of that size approximate the
actual top-K frequent itemsets, with probability larger than a specified value.
We show that the upper bound is asymptotically tight when w is constant. Our
main algorithmic contribution is a progressive sampling approach, combined with
suitable stopping conditions, which on appropriate inputs is able to extract
approximate top-K frequent itemsets from samples whose sizes are smaller than
the general upper bound. In order to test the stopping conditions, this
approach maintains the frequency of all itemsets encountered, which is
practical only for small w. However, we show how this problem can be mitigated
by using a variation of Bloom filters. A number of experiments conducted on
both synthetic and real bench- mark datasets show that using samples
substantially smaller than the original dataset (i.e., of size defined by the
upper bound or reached through the progressive sampling approach) enable to
approximate the actual top-K frequent itemsets with accuracy much higher than
what analytically proved.Comment: 16 pages, 2 figures, accepted for presentation at ECML PKDD 2010 and
publication in the ECML PKDD 2010 special issue of the Data Mining and
Knowledge Discovery journa